Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Quantization of Neural Networks

FIGURE 2.2

Given s = 1, QN = 0, QP = 3, A) quantizer output and B) gradients of the quantizer

output concerning step size, s, for LSQ, or a related parameter controlling the width of

the quantized domain (equal to s(QP + QN)) for QIL [110] and PACT [43]. The gradient

employed by LSQ is sensitive to the distance between v and each transition point, whereas

the gradient employed by QIL [110] is sensitive only to the distance from quantizer clip

points and the gradient employed by PACT [43] is zero everywhere below the clip point.

Here, we demonstrate that networks trained with the LSQ gradient reach a higher accuracy

than those trained with the QIL or PACT gradients in prior work.

2.2.3

Step Size Gradient Scale

It has been demonstrated that good convergence during training can be achieved when the

ratio of average update magnitude to average parameter magnitude is consistent across all

weight layers in a network. Setting the learning rate correctly helps prevent updates from

being too large and causing repeated overshooting of local minima or too small, leading

to a slow convergence time. Based on this reasoning, it is reasonable to assume that each

step size should also have its update magnitude proportional to its parameter magnitude,

similarly to the weights. Therefore, for a network trained on a loss function L, the ratio

R = ^∇^s^L

/^∥∇^w^L^∥

∥w∥

(2.11)

should be close to 1, where ∥x∥denotes the l2-norm of z. However, as precision increases, the

step size parameter is expected to be smaller (due to ﬁner quantization), and the step size

updates are expected to be larger (due to the accumulation of updates from more quantized

items when computing its gradient). To address this, a gradient scale g is multiplied by the

step size loss. For the weight step size, g is calculated as 1/^√NW QP , and for the activation

step size, g is calculated as 1/^√NW QP , where NW is the number of weights in a layer and

Nf is the number of features in a layer.

2.2.4

Training

LSQ trains the model quantizers by making the step sizes learnable parameters, with the loss

gradient computed using the quantizer gradient mentioned earlier. In contrast, other model

parameters can be trained with conventional techniques. A common method of training

quantized networks [48] is employed where full precision weights are stored and updated,

while quantized weights and activations are used for forward and backward passes. The

gradient through the quantizer round function is calculated using the straight-through es-

timator [9] so that

∂ˆv

∂v ⁼

if −QN < v/s < Qp,

otherwise,

(2.12)

and stochastic gradient descent is used to update parameters.